Introduction

These are notes from current progress on the LimoRhyde → fGSEA pipe. Note that development is very much ongoing, and everything is subject to change.
Original file location for this document: /code/R/notes

Context

Since I haven’t formally written out the rationale behind this pipe, I’ll write it here. Previously, we have attempted several pipes. One is the circadian gene identification → pathway analysis using non-expression based tools link, another is differential analysis at every time point → pathway analysis using fGSEA link. The former pipe is well underway. The latter pipe has hit some serious snags - the outputs are interesting, but much work remains to be done before any conclusions can be drawn mostly because the outputs are not yet readable or interpretable in a meaningful way. This current notebook is a part of that analysis - I hope that, by using an analysis that combines differential rhythmicity and differential expression, we can better setup our pathway analysis.

LimoRhyde

LimoRhyde is reported to be a tool that can identify rhythmic, differentially rhythmic, and differentially expressed genes all in one package .

These notes

This notebook was me following the LimoRhyde vignette and applying it to our dataset. The tutorial can be viewed via browseVignettes('limorhyde').

Some of the major challenges included

  • the vignette worked with microarray data
  • the vignette used an eset dataset it imported, which I needed to generate from our dataset
  • the vignette was run on a 2-arm experimental setup, just a WT and KO, unlike ours which has 4 arms. It is unclear whether multiple arms is supported innately by LimoRhyde. It may also be possible to run the analysis multiple times to cover multiple analyses, unsure yet.

Generating eset

I took some time to generate an eset from our data. I did this because I thought this was required by LimoRhyde, and also because it seems like an appropriate formal method to store data, especially for later when the dataset becomes published.

I used the following resources to guide me through this:

The script where this was run is eset_conversion.R.

I modelled much of the eset structure off the LimoRhyde vignette eset. To view this data set, load the package and run this command.

eset = getGEO(filename = system.file('extdata', 'GSE34018_series_matrix.txt', package = 'limorhyde'))
View(eset)

File structure

The location for where the eset data is stored is /data/expressionset/

/data/expressionset/
│
├─eset.Rds              eset R dataset
├─metadata.xlsx         eset metadata in xlsx
├─metadata.csv          generated from metadata.xlsx
├─phenoData.xlsx        eset phenoData in xlsx
├─phenoData.csv         generated from phenoData.xlsx

Setup

Libraries imported.

library(Biobase)
library(magrittr)
library(dplyr)
library(DT)

dataDirectory = /code/R/notes/data/expressionset/

The feature counts data can be read in as is to eset@assayData.
This is enough to build a ‘minimal’ eset.

dataDirectory <- "../../../data/expressionset/"

exprs <- read.csv2("../../../KF_RNASeq_counts_filtered.txt", sep = ",", header = T, row.names = 1) %>%
  set_rownames(.$Geneid) %>%
  dplyr::select(-Geneid) %>%
  as.matrix()

minimalSet <- ExpressionSet(assayData = exprs)

phenoData and metadata are read in, and an AnnotatedDataFrame object is built using both.

pData <- read.table(paste0(dataDirectory, "phenoData.csv"), row.names = 1, header = T, sep = ",")

metadata <- read.table(paste0(dataDirectory, "metadata.csv"), row.names = 1, header = T, sep = ",")
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## EOF within quoted string
phenoData <- new("AnnotatedDataFrame",
                  data = pData, varMetadata = metadata)

Let’s take a look at the object structure.

phenoData
## An object of class 'AnnotatedDataFrame'
##   rowNames: KF_01 KF_02 ... KF_72 (72 total)
##   varLabels: sample_number title ... ZT_time (38 total)
##   varMetadata: labelDescription

We can view the phenoData.

phenoData@data %>%
  DT::datatable()

The metadata has the same number of rows as phenoData’s columns. It contains a description of each phenoData column name so people working with the data know what you’re referring to by, e.g., “KO” in a column.

phenoData@varMetadata %>%
  DT::datatable()

experimentData is simply information about the experiment itself.

experimentData <- new("MIAME",
                      name=c("Sumeed Yoyo Manzoor","Katya Frazier"),
                      lab="Chang Lab",
                      contact="smanzoor@uchicago.edu",
                      title="WT vs L-Bmal1-KO in SPF vs GF conditions RNA Seq data",
                      abstract="FILLER TEXT",
                      url="https://changlab.uchicago.edu ",
                      other=list(
                        notes="Created from excel sheets and text files"
                      ))

Finally, the formal eset is built using the above items.

eset <- ExpressionSet(assayData=exprs,
                      phenoData=phenoData,
                      experimentData=experimentData,
                      annotation="NA")

It can be saved in the data directory.

saveRDS(eset, paste0(dataDirectory, "eset.Rds"))

And loaded in future sessions.

test <- readRDS(paste0(dataDirectory, "eset.Rds"))

A brief look at the eset data.

eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 55573 features, 72 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: KF_01 KF_02 ... KF_72 (72 total)
##   varLabels: sample_number title ... ZT_time (38 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: NA

TODO

The eset is missing some data that we will need to fill in later:

  • eset@phenoData@data the following columns
    • extract_protocol RNA extraction protocol used
    • label_protocol RNA labelling protocol
    • hyb_protocol Hybridization protocol, probably not relevant to RNA Seq
    • scan_protocol Also probably not relevant to RNA Seq
    • data_processing Data processing to get featurecounts
    • platform_id Illumina platform
  • eset@protocolData
  • eset@featureData
  • eset@annotation Illumina platform

LimoRhyde

Setup

Libraries are imported.

library('annotate')
library('data.table')
library('foreach')
library('GEOquery')
library('ggplot2')
library('ggpubr')
library('knitr')
library('limma')
library('limorhyde')
library('org.Mm.eg.db')

Some functions were sourced from LimoRhyde.

source(system.file('extdata', 'vignette_functions.R', package = 'limorhyde'))

However, I didn’t end up using any of them.

Some parameters set. These values haven’t yet been reviewed, just set following the vignette’s instructions.

period = 24
qvalRhyCutoff = 0.15
qvalDrCutoff = 0.1

Load dataset

dataDirectory <- "../../../data/expressionset/"
eset = readRDS(paste0(dataDirectory, "eset.Rds"))

phenoData is read into sm, which is then selected only for sampple, condition, and Zeitgeber time.

sm = data.table(pData(phenoData(eset)))
sm = sm[, .(title, sample = sample_number, cond = genotype_microbe, time = ZT_time)]

setorderv(sm, c('cond', 'time'))

LimoRhyde then runs Fourier analysis of time to produce time_cos and time_sin.

sm = cbind(sm, limorhyde(sm$time, 'time_'))
sm %>%
  DT::datatable()

The vignette at this point obtains log-transformed expression values from eset@featureData. Instead, I just use featurecounts data eset@assayData$exprs.

emat = eset@assayData$exprs

Identify rhythmic genes

Using limma , rhythmic genes are obtained.

rhyLimma = foreach(condNow = unique(sm$cond), .combine = rbind) %do% {
  design = model.matrix(~ time_cos + time_sin, data = sm[cond == condNow])
  fit = lmFit(emat[, sm$cond == condNow], design)
  fit = eBayes(fit, trend = TRUE)
  rhyNow = data.table(topTable(fit, coef = 2:3, number = Inf), keep.rownames = TRUE)
  setnames(rhyNow, 'rn', 'geneId')
  rhyNow[, cond := condNow]
}

rhyLimmaSummary = rhyLimma[, .(P.Value = min(P.Value)), by = geneId]
rhyLimmaSummary[, adj.P.Val := p.adjust(P.Value, method = 'BH')]
setorderv(rhyLimmaSummary, 'adj.P.Val')

rhyLimmaSummary %>%
  DT::datatable()

Identify differentially rhythmic genes

All genes are used in statistical tests, but only results for rhythmic genes are kept.

design = model.matrix(~ cond * (time_cos + time_sin), data = sm)

fit = lmFit(emat, design)
fit = eBayes(fit, trend = TRUE)
drLimma = data.table(topTable(fit, coef = 5:6, number = Inf), keep.rownames = TRUE)
setnames(drLimma, 'rn', 'geneId')

drLimma = drLimma[geneId %in% rhyLimmaSummary[adj.P.Val <= qvalRhyCutoff]$geneId]
drLimma[, adj.P.Val := p.adjust(P.Value, method = 'BH')]
setorderv(drLimma, 'adj.P.Val')

drLimma %>%
  DT::datatable()

Identify differentially expressed genes

This is a linear model of cond alone.

design = model.matrix(~ cond + time_cos + time_sin, data = sm)

fit = lmFit(emat, design)
fit = eBayes(fit, trend = TRUE)
deLimma = data.table(topTable(fit, coef = 2, number = Inf), keep.rownames = TRUE)
setnames(deLimma, 'rn', 'geneId')

deLimma = deLimma[!(geneId %in% drLimma[adj.P.Val <= qvalDrCutoff]$geneId)]
deLimma[, adj.P.Val := p.adjust(P.Value, method = 'BH')]
setorderv(deLimma, 'adj.P.Val')

# deLimma$geneId = rownames(deLimma)
# rownames(deLimma) = NULL
# deLimma = anti_join(deLimma, filter(drLimma, adj.P.Val <= qvalDrCutoff), by = 'geneId')
# deLimma$adj.P.Val = p.adjust(deLimma$P.Value, method = 'BH')

deLimma %>%
  DT::datatable()

Plot the results

A volcano plot of \(\log_{2}\,\text{fold change}\) differentially expressed genes x \(\log_{10}\,q_{DE}\) (adj p val) shows many significant results with both positive and negative expression fold change.

ggplot(deLimma) +
  geom_point(aes(x = logFC, y = -log10(adj.P.Val)), size = 0.2, alpha = 0.5) +
  labs(x = expression(log[2]*' fold-change'), y = expression(-log[10]*' '*q[DE])) +
  geom_hline(yintercept = -log10(0.05)) +
  geom_text(aes(-700000, -log10(0.05), label = "adj p val = 0.05", vjust = -1, fontface = 3)) +
  theme_pubr()

Zooming in

trim_val <- 1000
deLimma %>%
  subset(.$logFC < trim_val) %>% 
  subset(.$logFC > -trim_val) %>%
  ggplot() +
    geom_point(aes(x = logFC, y = -log10(adj.P.Val)), size = 0.2, alpha = 0.5) +
    labs(x = expression(log[2]*' fold-change'), y = expression(-log[10]*' '*q[DE])) +
    geom_hline(yintercept = -log10(0.05)) +
    geom_text(aes(-1000, -log10(0.05), label = "adj p val = 0.05", vjust = -1, fontface = 3)) +
    theme_pubr()

The top rhythmic gene, top differentially rhythmic gene, and top differentially expressed gene is pulled.
I get gene names from ENSEMBL via an AnnotationDbi query see page 4.

geneIdsNow = c(rhyLimmaSummary$geneId[1], drLimma$geneId[1], deLimma$geneId[1])
geneSymbolsNow = AnnotationDbi::select(org.Mm.eg.db, keys=geneIdsNow, keytype = 'ENSEMBL', columns = c('SYMBOL')) %>% dplyr::select(SYMBOL) %>% unlist() %>% unname()

Some information on these genes.

AnnotationDbi::select(org.Mm.eg.db, keys=geneIdsNow, keytype = 'ENSEMBL', columns = c('SYMBOL','GENENAME')) %>%
  mutate(desc = c('top rhythmic gene', 'top differentially rhythmic gene', 'top differentially expressed gene')) %>%
  DT::datatable()

A plot of expression by time in each condition.

df = data.table(t(emat[geneIdsNow, ]))
setnames(df, geneSymbolsNow)
df[, sample := colnames(emat[geneIdsNow, ])]

df = merge(df, sm[, .(sample, cond, time)], by = 'sample')
df = melt(df, measure.vars = geneSymbolsNow, variable.name = 'geneSymbol',
          value.name = 'expr')
levels(df$geneSymbol) = geneSymbolsNow

ggplot(df) +
  facet_grid(geneSymbol ~ cond, scales = 'free_y') +
  geom_point(aes(x = time, y = expr, shape = cond, color = geneSymbol), size = 2) +
  labs(x = 'Zeitgeber time (h)', y = 'Expression (norm.)') +
  scale_shape(solid = FALSE, guide = FALSE) +
  scale_color_brewer(type = 'qual', palette = 'Dark2', guide = FALSE) +
  scale_x_continuous(limits = c(0, 24), breaks = seq(0, 24, 4))

Thus, these

Conclusions

LimoRhyde works, which is great. I can throw the data into it, and the package data works as expected. There are some issues that need to be addressed asap, however.

  • It is unclear how LimoRhyde is handling the condition differences I gave it. While I get Fourier analysis run on all 4 conditions, I only get one value for gene expression q val, for example. This may be related to the 4 conditions as covariates in a linear model, but in any case it is something that needs to be resolved if we want to study differences between all four conditions. It is hard to draw any conclusions when a single value is outputted for an input of 4 different conditions.
  • It is worth thinking deeply about what the analysis is actually doing to determine if this is worth using with our data. It may be helpful to look at the other examples they mention in the paper.

Session info

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] org.Mm.eg.db_3.8.2   limorhyde_0.1.1      limma_3.40.6        
##  [4] knitr_1.28           ggpubr_0.2.5         ggplot2_3.3.0       
##  [7] GEOquery_2.52.0      foreach_1.4.8        data.table_1.12.8   
## [10] annotate_1.62.0      XML_3.99-0.3         AnnotationDbi_1.46.1
## [13] IRanges_2.18.3       S4Vectors_0.22.1     DT_0.13             
## [16] dplyr_0.8.5          magrittr_1.5         Biobase_2.44.0      
## [19] BiocGenerics_0.30.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4.6       tidyr_1.0.2        assertthat_0.2.1   digest_0.6.25     
##  [5] R6_2.4.1           RSQLite_2.2.0      evaluate_0.14      pillar_1.4.3      
##  [9] rlang_0.4.5        blob_1.2.1         rmarkdown_2.1      labeling_0.3      
## [13] splines_3.6.3      readr_1.3.1        stringr_1.4.0      htmlwidgets_1.5.1 
## [17] RCurl_1.98-1.1     bit_1.1-15.2       munsell_0.5.0      compiler_3.6.3    
## [21] xfun_0.12          pkgconfig_2.0.3    htmltools_0.4.0    tidyselect_1.0.0  
## [25] tibble_2.1.3       codetools_0.2-16   crayon_1.3.4       withr_2.1.2       
## [29] bitops_1.0-6       grid_3.6.3         jsonlite_1.6.1     xtable_1.8-4      
## [33] gtable_0.3.0       lifecycle_0.2.0    DBI_1.1.0          scales_1.1.0      
## [37] stringi_1.4.6      farver_2.0.3       ggsignif_0.6.0     xml2_1.2.5        
## [41] vctrs_0.2.4        RColorBrewer_1.1-2 iterators_1.0.12   tools_3.6.3       
## [45] bit64_0.9-7        glue_1.3.2         purrr_0.3.3        hms_0.5.3         
## [49] crosstalk_1.1.0.1  yaml_2.2.1         colorspace_1.4-1   memoise_1.1.0
 

By Sumeed Yoyo Manzoor

smanzoor@uchicago.edu